Data Integration using Unity Catalog

In this topic we will describe how to create a data ingestion pipeline with various data sources, using Databricks (Unity Catalog enabled) for data integration and ingesting the data into Databricks Unity Catalog data lake.

(Links will be added to topics like: What is Unity Catalog Why use Unity Catalog and so on.)

The following data sources are currently supported for a data ingestion pipeline using Databricks Unity Catalog:

  • FTP / SFTP

  • REST API

  • CSV

  • MS Excel

  • Parquet

  • Amazon S3

  • MS SQL

  • MySQL

  • PostgreSQL

  • Oracle

  • Snowflake

Prerequisites

  • Access to a Databricks node that has Unity Catalog enabled which will be used as a data integration node in the data ingestion pipeline. The Databricks Runtime version of the cluster must be 14.3.

  • Access to a Databricks Unity Catalog node which will be used as a data lake in the data ingestion pipeline.

Creating a data ingestion pipeline

  1. On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:

    • Data Sources

    • Data Integration (Databricks - Unity Catalog enabled)

    • Data Lake (Databricks Unity Catalog)

    For the sake of an example, we are using FTP in the data source node.

    Databricks Unity Catalog Integration Pipeline

  2. Configure the data source node. If you are using CSV, MS Excel or Parquet as a data source, then do the following:

    1. Click the source node. Select New Datastore and click Browse this computer. Select a file to upload.

    2. In the Storage Type field, select AWS S3 and then select a configured storage instance on S3. You must have access to the configured storage instance to be able to view and select it.

      Note:

      In the data integration stage, if you are using a Databricks instance deployed on AWS, then you must select AWS S3 as the storage type. If you are using a Databricks instance deployed on Azure, then you must use Azure Data Lake as the storage type.

  3. Configure the data lake node.

    • Click the dropdown Use an existing Databricks Unity Catalog, select an instance. Click Add to data pipeline.

    • Click the dropdown Schema Name and select a schema.

    • Click Data Browsing. Browse the folders and view the required files. This step is optional.

    • Click Save.

  4. Click the Databricks node in the data integration stage. Click Create Templatized Job. Complete the following steps to create the job: